Speed up inlist for strings and primitives#813
Conversation
|
FYI @jhorstmann |
| }}; | ||
| } | ||
|
|
||
| fn values_in_list_utf8<OffsetSize: StringOffsetSizeTrait>( |
There was a problem hiding this comment.
This could move to Arrow sometime.
|
The test appear to be failing due to #818 |
alamb
left a comment
There was a problem hiding this comment.
I think this looks very cool -- nice work @Dandandan
I wonder if there is sufficient coverage of all these cases (e.g now there are different code paths for lists with/without nulls and negated/not negated that is different for primitive cases
| let values = $LIST_VALUES | ||
| .iter() | ||
| .flat_map(|expr| match expr { | ||
| ColumnarValue::Scalar(s) => match s { |
There was a problem hiding this comment.
Given this code just seems to pass along the values, wonder if it would be clearer / faster here to look for nulls using the ScalarValue::is_null(), perhaps like
let contains_nulls = $LIST_VALUES.iter().any(|expr| expr.is_null());And that way you could stop when you hit the first one as well as potentially avoid a collect into a new Vec
| }}; | ||
| } | ||
|
|
||
| fn values_in_list_primitive<T: ArrowPrimitiveType>( |
There was a problem hiding this comment.
I suggest adding some comments here noting these functions assume that there are no null values in the column
There was a problem hiding this comment.
I could for clarity. Note that the values slice can not contain nulls based on the type, but the array can have nulls (and should work for that case).
| }) | ||
| .collect::<BooleanArray>(), | ||
| ))) | ||
| if negated { |
There was a problem hiding this comment.
The same comment about finding NULL by checking for the first match applies to the code right above this as well
| ) | ||
| } | ||
| DataType::Int64 => { | ||
| make_contains!(array, list_values, self.negated, Int64, Int64Array) |
There was a problem hiding this comment.
Is there a reason not to apply the same transformation to Int64 and Int8?
There was a problem hiding this comment.
Oops, great catch.
| unimplemented!("InList does not support datatype {:?}.", datatype) | ||
| } | ||
| datatype => Result::Err(DataFusionError::NotImplemented(format!( | ||
| "InList does not support datatype {:?}.", |
Which issue does this PR close?
Closes #814
Rationale for this change
This gives a ~10-15% speed up for TCP-H query 12 (~24ms vs 28ms in memory) which spends quite some time in the code to execute
l_shipmode in ('MAIL', 'SHIP')For the micro bench on primitives (f32 values):
A large part comes from the use of the slower
BooleanArray::from_iterand some more branching inside the loop.This PR basically optimizes the easier cases when the list to compare to doesn't contain nulls.
What changes are included in this PR?
compare_op_scalarfrom arrow.Are there any user-facing changes?
Some code now returns
DataFusionError::NotImplementederror instead of runtime panic.